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SUMMARY  OF  PROGRAM 
PROGRAM  OBJECTIVES 

To  develop  practical,  low  cost,  real-time  methods  for  suppressing  noise 
which  has  been  acoustically  added  to  speech. 

To  demonstrate  that  through  the  incorporation  of  the  noise  suppression 
methods  speech  can  be  effectively  analysed  for  narrow  band  digital 
transmission  in  practical  operating  environments. 

SUMMARY  OF  TASKS  AND  RESULTS 
INTRODUCTION 

In  Section  II  thn  key  research  efforts  of  the  program  are  summarized. 


In  Section  IV,  three  recent  technical  papers  are  presented.  Section  V  lists 
the  major  publications  generated  under  this  contract. 
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SECTION  II 

SUPPRESSION  OF  ACOUSTIC  NOISE  IN  SPEECH  USING 
SPECTRAL  SUBTRACTION 

I.  IOTRODUCTION 

Background  noise  acoustically  added  to  speech  can  degrade  the 
performance  of  digital  voice  processors  used  for  applications  such  as  speech 
compression,  recognition,  and  authentication  [1],  The  effects  of  background 
noise  can  be  reduced  by  using  noise-cancelling  microphones,  internal 
modification  of  the  voice  processor  algorithms  to  explicitly  compensate  for 
signal  contamination,  or  preprocessor  noise  reduction.  Noise-cancelling 
microphones,  although  essential  for  extremely  high  noise  environments  such  as 
the  helicopter  cockpit,  offer  little  or  no  noise  reduction  above  1  kHz  [1], 
Techniques  available  for  voice  processor  modification  to  account  for  noise 
contamination  are  being  developed  [4].  Preprocessor  noise  reduction  offers 
the  advantage  that  noise  stripping  is  done  on  the  waveform  itself  with  the 
output  being  either  digital  or  analog  speech.  Thus,  existing  voice  processors 
tuned  to  clean  speech  can  continue  to  be  used  unmodified.  Also,  since  the 
output  is  speech,  the  noise  stripping  becomes  independent  of  any  specific 
subsequent  speech  processor  implementation  (it  could  be  connected  to  a  CCD 
channel,  vocoder  or  a  digital  LPC  vocoder). 

The  objectives  of  this  research  were  to  develop  a  noise  suppression 
technique,  implement  a  computational ly  efficient  algorithm,  and  test  its 
performance  in  actual  noise  <-’nvi ronments.  The  approach  used  was  to  estimate 
the  magnitude  frequency  spectrum  of  the  underlying  clean  speech  by  subtracting 
the  noise  magnitude  spectrum  from  the  noisy  speech  spectrum.  The  average 


noise  magnitude  was  measured  during  nonspeech  activity.  The  noise  suppressor 
is  implemented  using  about  the  same  amount  of  computation  as  required  in  a  FFT 
convolution.  It  is  tested  on  speech  recorded  in  a  helicopter  environment. 
Its  performance  is  measured  using  the  Diagnostic  Rhyme  Test  (DRT)  [6]. 

SIGNAL  II.  ESTIMATION  USING  SPECTRAL  SUBTRACTION  [3] , [4] 

Signal  x(i)  digitised  from  a  single  microphone  consists  of  the  sum  of 
speech  Sp(i)  and  ambient  acoustic  noise  n(i).  It  is  assumed  that  the  noise  is 
locally  stationary  to  the  extent  that  average  value  of  its  spectral  magnitude 
during  speech  activity  is  equal  to  that  measured  just  prior  to  speech 
activity.  Using  these  assumptions  the  spectral  subtraction  algorithm  attempts 
to  suppress  the  additive  acoustic  noise  component  n  ( i  >  from  x(i)  by  the 
following  steps: 

1.  Segment  the  noisy  data  into  windowed  analysis  blocks  of  length  M 
samples,  x ( i) , i=0,l . . . ,M-1 . 

2.  Compute  the-  N  point  DPT  X(k)  of  data  x(i). 

3.  Estimate  the  speech  spectrum  S(k)  by  subtracting  the  average  noise 
spectral  magnitude,  B(k)  =  avo|N(k)  I,  calculated  during  non-speech  activity, 
from  | X ( k )  I : 

S(k)  =  1 |X(k)  |-U(k)  exp  (j  ARG[X(k)|) 

The  motivation  behind  this  approach  is  to  subtract  from  the  noisy 
speech  spectrum,  an  estimate  of  the  noise  spectrum  which  is  readily  available. 
The  magnitude  of  N(k)  is  replaced  by  its  average  value,  B(k) ,  and  the  phase  of 
N(k)  is  replaced  by  the  phase  of  X(k) . 
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INTELLIGIBILITY  AND  QUALITY  TESTING  RESULTS  CM 
SPECTRAL  SUBTRACTION  AND  LPC-10  [5] 

Experiment  Definition 

The  data  base  consisted  of  a  three-speaker  Diagnostic  Rhyme  Test  (DRT) 
list  recorded  in  the  RH-53  helicopter.  This  data  base  was  processed  by  the 
real-time  spectral  subtraction  algorithm  as  implemented  on  the  Utah  FPS-120B 
array  processor.  Audio  tapes  consisting  of  the  original  digital  source  and 
the  spectral  subtraction  output  were  then  sent  to  the  Naval  Research 
Laboratory,  (NRL) .  Each  of  these  tapes  were  processed  through  NRL's  LPC-10 
2400bps  real-time  bandwidth  compression  system,  generating  two  more  tapes: 
original  digital  source  with  LPC,  and  spectral  subtraction  output  with  LPC. 

Finally  these  four  tapes  were  sent  to  Dynastat  for  intelligibility  scoring. 
Results 

The  total  DRT  score  for  each  tape  is: 

Original  Digitized  Source  =  85.2 

Spectral  Subtraction  Output  =79.8 

Original  Digitized  Source  With  LPC  =  53.9 

Spectral  Subtraction  Output  with  LPC  =  64.5 

Discussion 

The  results  of  this  experiment  clearly  show  that  the  intelligibility  of 
2400  bps  LPC  coded  speech  can  be  significantly  increased  by  preprocessing  with 
spectral  subtraction.  These  results  should  be  considered  as  a  lower  bound  for 
expected  performance.  For  an  actual  implementation,  the  intermediate  analog 
tape  recording  would  be  absent.  More  importantly  the  noise  suppression 
algorithm  could  be  tailored  if  necessary  to  compensate  for  known  vocoder  noise 
sensitivities.  (This  version  was  not  tailored  to  operate  with  any  specific 
vocoder.)  Finally  the  noise  rejection  below  1kHz  could  be  further  improved  by 
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use  of  an  improved  noise  cancellation  microphone. 
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SUPPRESSION  OF  ACOUSTIC  NOISE  IN  SPEECH 
USING  TWO  MICROPHONE  ADAPTIVE  NOISE  CANCELLATION 

INTRODUCTION 

It  has  been  shown  that  there  is  a  significant  reduction  in  measured 
speech  intelligibility  and  quality  due  to  the  ambient  background  noise 
generated  in  many  operating  environments  £1],  A  number  of  single  microphone 
approaches  for  reducing  the  background  noise  added  to  speech  have  been 
developed  [2].  However  these  methods  become  ineffective  when  the  noise  power 
is  equal  to  or  greater  than  the  signal  power  or  when  the  noise  spectral 
characteristics  change  rapidly  in  time.  This  summary  describes  an  alternative 
approach  to  noise  suppression  in  which  a  second  correlated  noise  source  is 
adaptively  filtered  to  minimize  the  output  power  between  the  two  microphone 
signals.  Hiree  adaptive  algorithm  implementations  were  investigated:  the 
Widrow-Hoff  LMS  approach  [4],  the  lattice  gradient  approach  [3),  [4],  and  the 
frequency  domain  short  time  Fourier  Transform  approach  [5].  Each  approach  was 
compared  in  terms  of  degree  of  noise  power  reduction,  algorithm  settling  time, 
and  degree  of  speech  enhancement. 


RESULTS 

The  performance  of  any  noise  suppression  algorithm  is  ultimately 
determined  by  the  improvement  in  measured  intelligibility  and  quality  due  to 
the  algorithm.  Quantitative  methods  for  measuring  these  improvements  use 
scoring  tests  such  as  the  DRT  [6].  At  the  time  of  this  experiment,  a 
two-microphone  data  base  was  not  available. 

Instead  a  controlled  data  base  was  used  to  compare  the  performance  of 
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these  three  methods:  A  stationary  white  noise  source  was  recorded  from  an 
analog  noise  generator  onto  audio  tape.  The  acoustic  noise  was  generated  by 
playing  the  audio  tape  out  through  a  loud  speaker  into  a  hard  walled  room. 
The  reference  signal  microphone  was  placed  next  to  the  loud  speaker,  while  the 
primary  microphone  was  placed  twelve  feet  away  next  to  the  control  terminal. 
The  speaker  spoke  into  the  primary  microphone  while  controlling  the  stereo 
recording  program.  Hie  noise  power  was  adjusted  to  such  a  level  that  the 
recorded  speech  was  completely  masked.  The  signals  were  filtered  at  3.2kHz, 
sampled  at  6.67kHz,  and  quantized  to  fifteen  bits.  Recordings  were  made  with 
and  withou:  speech  present,  each  lasting  24.5  sec. 

For  each  time  domain  algorithm  a  step  size  was  chosen  such  that  the 
echo  induced  at  the  output  was  barely  discernible.  Such  a  choice  thus 
represents  a  compromise  between  fast  adaptation,  (step  size  large)  and  minimal 
speech  distortion,  (step  size  small) .  Each  algorithm  then  processed  the 
acoustic  data  in  the  absence  of  speech  activity  in  order  to  determine 
convergence  rate  versus  processing  time.  Each  method  reaches  a  steady  state 
error  of  about  -15dB  after  about  15  seconds.  Since  the  noise  was  acoustically 
added,  no  underlying  clean  speech  spectrum  was  available  for  comparison. 
However,  it  was  judged  that  the  intelligibility  of  the  processed  speech  had 
clearly  improved.  This  was  based  upon  the  fact  that  before  processing  it  was 
difficult  to  even  detect  that  there  was  speech  present  in  the  noise,  while 
after  processing  the  speech  was  understandable. 

In  summary,  though  this  two  microphone  approach  to  noise  suppression 
requires  a  second  signal  and  possibly  excessive  computation  due  to  long  filter 
lengths,  it  offers  a  potentially  powerful  approach  for  speech  enhancement  in 
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severe  noise  environments.  Finally  the  processing  time  of  frequency  domain 
FORTRAN  algorithm  was  approximately  3  1/2  time  faster  than  the  IMS  FORTRAN 
algorithm  as  predicted  due  to  the  efficiency  provided  by  the  FFT. 
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TOWARDS  A  MATHEMATICAL  THEORY  OF  PERCEPTION 

James  Kajiya 

ABSTRACT 

A  new  technique  for  the  modelling  of  perceptual  systems  called  formal 
modelling  is  developed.  This  technique  begins  with  qualitative  observations 
about  the  perceptual  system,  the  so-called  perceptual  symmetries,  to  obtain 
through  mathematical  analysis  certain  model  structures  which  may  then  be 
calibrated  by  experiment.  The  analysis  proceeds  in  two  different  ways 
depending  upon  the  choice  of  linear  or  nonlinear  models.  For  the  linear  case, 
the  analysis  proceeds  through  the  methods  of  unitary  representation  theory. 
It  begins  with  a  unitary  group  representation  on  the  image  space  and  produces 
what  we  have  called  the  fundamental  structure  theorem.  For  the  nonlinear 
case,  the  analysis  makes  essential  use  of  infinite-dimensional  manifold 
tneory.  It  begins  with  a  Lie  group  action  on  an  image  manifold  and  produces 
the  fundamental  structure  formula. 

These  techniques  will  be  used  to  study  the  brightness  perception 
mechanism  of  the  human  visual  system.  Several  visual  groups  are  defined  and 
their  corresponding  structures  for  visual  system  models  are  obtained.  A  new 
transform  called  the  Mandala  transform  will  be  deduced  from  a  certain  visual 
group  and  its  implications  for  image  processing  will  be  discussed.  Several 
new  phenomena  of  brightness  perception  will  be  presented.  New  facts  about  the 
Mach  band  illusion  along  with  new  adaptation  phenomena  will  be  presented. 
Also  a  now  visual  iLLusion  will  be  presented.  A  visual  model  based  on  the 
above  techniques  will  bo  presented.  It  will  also  be  shown  how  use  of 
statistical  estimation  theory  can  be  made  in  the  study  of  contrast  adaptation. 
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Furthermore,  a  mathematical  interpretation  of  unconscious  inference  and  a 
simple  explanation  of  the  Tolhurst  effect  without  mutual  channel  inhibition 
will  be  given.  Finally,  image  processing  algorithms  suggested  by  the  model 
will  be  used  to  process  a  real-world  image  for  enhancement  and  for  "form"  and 
texture  extraction. 


12 


A  CONSTANT  PERCENTAGE  BANDWIDTH  TRANSFORM 
FOR  ACOUSTIC  SIGNAL  PROCESSING 

James  E.  Young berg 

ABSTRACT 

This  paper  describes  a  constant  percentage  bandwidth  transform  for 
acoustic  signal  processing.  Such  a  transform  is  shown  to  emulate  behavior 
found  in  the  human  auditory  system,  making  possible  both  the  imitation  of 
peripheral  auditory  analysis,  and  processing  which  is  more  closely  linked  to 
perception  than  is  possible  using  constant  bandwidth  analysis. 

To  enable  such  processing,  a  synthesis  transformation  is  developed 
which,  when  cascaded  with  the  analysis  transformation,  provides  an 
analysis-synthesis  identity  in  the  absence  of  spectral  modification.  Various 
properties  of  the  transform  pair  are  derived,  and  a  filterbank  analogy  is  used 
to  create  a  basis  for  intuitive  understanding  of  the  transform's  operation  and 
properties . 

The  effects  of  spectral  domain  modification  are  described  and  shown  to 
be  related  to  the  properties  of  the  analysis  window  function. 

Principles  governing  discrete  implementation  of  the  transform  pair  are 
discussed,  and  relationships  are  formalized  which  specify  the  sampling  of  the 
spectral  domain.  These  relationships  are  shown  to  depend  simultaneously  on 
the  analysis  window  function  and  the  selectivity  (or  Q)  of  analysis.  An 

alternative  form  of  the  synthesis  is  given  which  facilitates  a  more  nearly 

optimal  logarithmic  sampling  of  the  spectral  frequency  axis.  A  minimal 

sampling  pattern  is  given  for  the  spectral  domain  which  has  an  overall  rate 

equivalent  to  the  rate  necessary  to  sample  the  constant  bandwidth  spectral 
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domain . 


The  nature  and  computation  of  the  constant-Q  spectral  magnitude  and 
phase  function  is  discussed,  and  three  main  methods  are  evaluated  whereby  the 
spectral  phase  may  be  unwrapped. 

Fine  resolution  constant-Q  spectrograms  are  presented  which  show 
clearly  the  properties  of  constant-Q  analysis  applied  to  speech. 

The  use  of  the  transform  pair  is  discussed  in  the  solution  of  the 
perception-related  problem  of  time  scale  compression  and  expansion  of  speech. 
Results  of  this  experiment  are  discussed. 

Finally,  suggestions  for  further  research  and  applications  are 


presented . 


1 
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ACOUSTIC  SIGNAL  PROCESSING  IN  THE  CONTEXT 
OF  A  PERCEPTUAL  MODEL 

Tracy  Lind  Petersen 

ABSTRACT 

The  perceptual  analysis  of  acoustic  waveforms  by  the  auditory  system 
involves  both  mechanical  and  neural  transformations  of  the  stimulating  signal. 
Therefore,  a  distinction  exists  between  the  stimulus  space  as  characterized  by 
acoustic  vibration,  and  the  auditory  perceptual  soace  as  characterized  by 
perceptually  transformed  acoustic  signal  information.  This  dissertation 
explores  acoustic  signal  processing  within  the  domain  of  auditory  perception, 
beginning  with  the  formal  development  of  an  integral  transformation  which  can 
simulate  certain  frequency  selective  properties  of  the  auditory  system. 

A  parameterized  family  of  analysis-synthesis  transform  pairs  which 
behave  as  identities  in  the  absence  of  perceptual  modification  is  developed 
from  a  property  of  homogeneous  functions.  A  particular  member  of  the 
transform  family  is  then  implemented  to  simulate  frequency  selective 
properties  of  the  peripheral  auditory  system.  Frequency  sensitivity  typically 
found  in  fibers  of  the  auditory  nerve  is  also  modeled. 

Following  this,  an  ability  of  the  auditory  brain  to  suppress  the 
perception  of  background  noise  is  simulated,  based  on  a  mathematical  model  of 
loudness  perception.  This  method  of  noise  suppression,  called  "perceptual 
subtraction"'  is  applied  to  the  noise  suppression  processing  of  signals 
corrupted  by  additive  noise.  The  signal  processing  results  give  empirical 
support  to  a  theory  which  has  been  put  forward  to  explain  loudness  processing 
by  the  brain. 


SPEECH  ARTICULATION  RATE  CHANGE  USING  RECURSIVE 
BANDWIDTH  SCALING 

H.  Ravindra 


ABSTRACT 

Speech  articulation  rate  change  is  done  by  analyzing  the  speech  signal 
into  several  frequency  channels,  scaling  the  unwrapped  phase  signal  in  each 
channel  and  synthesizing  a  new  speech  signal  using  the  modified  channel 
signals  and  their  scaled  center  f requencies.  It  is  shown  that  each  channel 
signal  can  be  modeled  as  the  simultaneous  amplitude  and  phase  modulation  of  a 
carrier  and  that  only  scaling  the  phase  modulating  signal  does  not  result  in  a 
proportional  scaling  of  the  bandwidth  of  the  channel  signals  which  results  in 
the  introduction  of  different  types  of  distortions  like  frequency  aliasing 
between  channels  when  an  increase  in  the  articulation  rate  is  attempted  and 
reverberation  when  a  rate  reduction  is  attempted.  It  is  proposed  that  the 
amplitude  modulating  signal  bandwidth  should  also  be  scaled  and  a  recursive 
method  to  do  this  is  discussed. 
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ESTIMATION  OF  THE  PARAMETERS  OF  AN  AUTOREGRESSIVE 

PROCESS 

IN  THE  PRESENCE  OF  ADDITIVE  WHITE  NOISE 

William  J.  Done 

ABSTRACT 

Applications  of  linear  prediction  (LP)  algorithms  have  been  successful 
in  modeling  various  physical  processes.  In  the  area  of  speech  analysis  this 
has  resulted  in  the  development  of  LP  vocoders,  devices  used  in  digital  speech 
communication  systems.  The  LP  algorithms  used  in  speech  and  other  areas  are 
based  on  all-pole  models  for  the  signal  being  considered.  With  white  noise 
excitation  to  the  model,  the  all-pole  LP  model  is  equivalent  to  the 
autoregressive  (AR)  model. 

With  the  the  success  of  this  model  for  speech  well  established,  the 
application  of  LP  algorithms  in  noisy  enviroaments  is  being  considered. 
Existing  LP  algorithms  perform  poorly  in  these  conditions.  Additive  white 
noise  severely  effects  the  intelligibility  and  quality  of  speech  after 
analysis  by  an  LP  vocoder. 

It  is  known  that  the  addition  of  white  noise  to  an  AR  process  produces 
data  that  can  be  described  by  an  autoregressive  moving-average  (ARMA)  model. 
The  AR  coefficients  of  the  ARMA  model  are  identical  to  the  AR  coefficients  of 
the  original  AR  process.  This  dissertation  investigates  the  practicality  of 
this  model  for  estimating  the  coefficients  of  the  original  AR  process.  The 
^  mathematical  details  for  this  model  are  reviewed.  Those  for  the 

autocorrelat ion  method  LP  algorithm  are  also  discussed. 

Experimental  results  obtained  from  several  parameter  estimation 
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techniques  are  presented.  These  methods  include  the  autocorrelation  method 
for  LP  and  a  Newton-Raphson  algorithm  which  estimates  the  ARMA  parameters  from 
the  noisy  data.  These  estimation  methods  are  applied  to  several  AR  processes 
degraded  by  additive  white  noise.  Results  show  that  using  an  algorithm  based 
on  the  ARMA  model  for  the  data  improves  the  estimates  for  the  original  AR 
coefficients. 


- 
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APPLICATION  OF  NQNPARAMETRIC  RANK-ORDER 
STATISTICS  TO  ROBUST  SPEECH  ACTIVITY  DETECTION 

B.  V.  Cox 

ABSTRACT 

This  report  describes  a  theoretical  and  experimental  investigation  for 
detecting  the  presence  of  speech  in  wideband  noise.  A  robust  algorithm  for 
making  the  silence-voiced-unvoiced  decision  is  described.  This  algorithm  is 
based  on  a  nonparametr ic  statistical  signal-detection  scheme  that  does  not 
require  a  training  set  of  data  and  maintains  a  constant  false-alarm  rate  for  a 
broad  class  of  noise  inputs.  The  nonparametr ic  decision  procedure  is  the 
multiple  use  of  the  two-sample  Savage  T  statistic.  The  performance  of  this 
detector  is  evaluated  and  compared  to  that  obtained  by  manually  classifying 
twenty  recorded  utterances  with  39,  30,  20,  10,  and  0  decibel  signal-to-noise 
ratios.  In  limited  testing,  the  average  probability  of  misclassif ication  is 
less  than  6  percent,  12  percent,  and  46  percent  for  signal-to-noise  ratios  of 
39,  20,  and  0  decibels  respectively. 
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SECTION  IV 

CRITICAL  BAND  ANALYSIS-SYNTHESIS 

Tracy  L.  Petersen 
Steven  F.  Boll 

ABSTRACT 

The  formal  derivation  of  an  integral  transformation  which  can  simulate 
certain  frequency  selective  (critical  bandwidth)  properties  of  the  auditory 
system  is  given.  A  parameterized  family  of  analysis-synthesis  transform  pairs 
which  behave  as  identities  in  the  absence  of  perceptual  modification  is 
developed  from  a  property  of  homogeneous  functions.  Hie  formulation 

facilitates  a  flexible  choice  of  analysis  frequencies  and  frequency  selective 
response  characteristics.  A  particular  member  of  the  transform  family  is  then 
implemented  to  simulate  frequency  selective  properties  of  the  peripheral 
auditory  system. 

INTRODUCTION 

Motivation 

A  motivation  for  the  work  presented  here  is  based  in  the  distinction 
which  exists  between  signal  representation  in  the  stimulus  domain  as 
characterized  by  acoustical  vibration  and  signal  representation  in  the 
perceptual  domain  as  characterized  by  the  firing  of  neurons  within  portions  of 
the  auditory  system.  The  work  to  be  described  provides  a  signal  processing 
framework  for  modeling  certain  perceptually  significant  properties  of  the 
auditory  system.  It  is  known  that  the  ear  has  bandwidth  sensitivity  which 
increases  with  frequency.  These  frequency  dependent  bandwidth  are  called 
critical  bands  and  their  existence  has  been  firmly  established [1] .  Frequency 
selective  characteristics  of  the  auditory  periphery  are  modeled  through  the 
design  of  an  integral  transform.  Formulation  of  the  transform  provides  a 


\ 
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flexible  choice  of  analysis  frequencies  and  frequency  selective  response 
characteristics.  A  particular  frequency  response  characteristic  is  formulated 
and  incremented  to  model  the  prototypical  frequency  sensitivity  of  auditory 


nerve  fibers. 


A  PARAMETERIZED  FAMILY  OF 
CONSTANT-Q  TRANSFORMS 


Introduction 

It  is  known  that  auditory  critical  bandwidth  increases  with  frequency. 
Kajiya  [21  derived  a  transform  which  is  "constant-Q"  in  the  sense  that  each 
bandpass  filter  involved  in  the  transformation  has  a  bandwidth  which  is  a 
constant  percentage  of  its  center  frequency.  The  transform  Q  is  given  by  the 
ratio  of  center  frequency  to  bandwidth.  This  transformation  has  been 
demonstrated  as  a  powerful  visual  modeling  and  image  processing  tool  [2],  and 
also  as  an  acoustic  signal  processing  tool  for  the  time  stretching  of  speech 


[31. 


The  constant-Q  transform  provides  a  transformation  integral  which  is 
similar  in  form  to  that  of  the  short-time  Fourier  transform.  For  purposes  of 
comparison  it  will  be  recalled  that  the  short-time  Fourier  analysis  integral 

is 

F(w,t)  =  /f(T)  h(t-T)  exp(-jwT)  dT  (1) 


where  f(t)  is  the  time  signal  to  be  analyzed  and  h(t)  is  the  impulse  response 
of  a  low-pass  function.  The  Constant-Q  analysis  integral  derived  by  Kajiya  is 

of  the  form 

F(w,t)  =  /  f  (T)  h  [  ( t-T)  w]  exp(-jv/T)  dT.  {2) 


The  argument  of  the  low-pass  function  h  of  equation  1  has  been  modified  in 
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equation  2  to  have  a  dependence  upon  frequency  w.  Equation  2  may  be 
interpreted  in  light  of  a  filterbank  analogy.  The  right  side  of  equation  4-2 
may  be  rewritten  as 

exp(-jwt)  / f (T)  h [ ( t-T)  w)  exp[j(t-T)  w]  dT 
which  means 

F(w,t)  =  exp(-jwt)  { f ( t)  *  [h(wt)  exp(jwt)]}.  (3) 

Thus  for  given  w,  n(w,t)  is  seen  to  be  a  baseband  demodulation  of  the  signal 
that  results  from  convolving  f(t)  with  a  filter  whose  impulse  response  is 
h(wt)  exp( jwt) .  Noting  the  frequency  response  of  this  filter  as  H(w,v)  with  v 
representing  Fourier  frequency,  and  designating  F(v)  as  the  Fourier  transform 
of  f(t),  allows  equation^ 3  to  be  written  as 

F(w,t)  =  exp(-jwt)  F (v)  H(w,v)exp(jvt)dv/Z»  (4) 

Homogeneous  Function  Formulation 

A  function  G(x^,X2,. . . ,xn)  is  called  homogeneous  of  degree  p  if  for  all 
real  c>0, 

G(cx^,cx2» . . . #cxn)  =  cP[G(xj ,X2, . . . ,xn) ] . 
l£t  G(w,v)  be  homogeneous  of  degree  p  and  a  bandpass  function  over  frequency  v 
with  center  frequency  w.  Then  it  can  be  shown [4]  that  the  bandpass  function  G 
is  Constant-Q 

Analysis-Synthesis  Derivation 

In  the  constant-Q  analysis  integral  of  equation  3,  the  constant-Q 
filter  function  is  represented  in  the  time  doiain  as  an  impulse  response 
h(wt).  In  order  to  achieve  the  design  flexibility  which  would  allow  the 
modeling  of  prototypical  auditory  filter  characteristics,  the  frequency  domain 
representation  of  equation  4  will  be  taken  as  a  starting  point.  The  relative 
frequency  spacing  of  simulated  auditory  filters  is  developed  in  terms  of 
position  frequencies  which  are  indicated  as  functions  of  w.  For  parameter  p 
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and  function  R  to  be  defined  below,  the  bandpass  frequency  response  H(w,v) 

from  equation  4  is  modified  to  be  of  the  form 

Hp[Rp(w) ,v] 

for  frequency  variable  v  and  center  or  position  frequency  Rp(w) ,  With  these 
modifications,  equation  4  becomes 

F[Rp(w) ,t]  =  exp[-jRp(w)t]  X  (5) 

F(v)  Hp[Rp(w)  , v ]  exp(jvt)  dv/Zii 

where  F(v)  is  the  Fourier  transform  of  the  input  signal  f(t),  and  F[Rp(w),t] 
is  the  constant-Q  transform  of  f(t)  evaluated  at  frequency  Rp(w)  and  tune  t. 


We  now  determine  functions  Rp,  Hp,  such  that  f(t)  is  recoverable  from 
F[Rp(w),t].  The  following  lemmas  nnd  theorem  are  stated  without  proof.  Their 

proofs  can  be  found  in  [4] 

LEMW  1. 


Suppose  Rp(w) 


Jexp(w)  ,  p=0 
(w^P,  p>0 


16.  P  R  >0 


and  d(p)  =\ 


^OO  ,  p=0. 


Further,  suppose  rlp[Rp(w) ,v]  has  the  following  properties: 


1.  Hp[Rp(w),v| 

2.  Hp[Rp(w),v] 

3.  Hp[Rp(w),v] 


=  0  for  v<0,  for  all  w. 
is  continuous  in  v. 

is  homogeneous  of  degree-p,  i.e.,  for  c>0 
Hp[cRp(w) ,cv ]  =  c'P  Hp[Rp(w) ,v] . 


Then  for  every  p>0  there  exists  a  constant  Bp  such  that 
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ID(v)  =  /H^R-Jw)  ,v]dw  (6) 

P  d(p)  P  p 

(Bp*  v>0 

|0,  v<0. 

LEMMA  2j_ 

If  p,  Ip,  Bp  are  as  above,  and  if  f(t)  is  a  real  time  signal  with  Fourier 
transform  F(v)  such  that  F(0  =0,  and  Bp  is  finite,  then 
F (t)  =  (2/Bp  )  RE (  F(v)  Ip(v)  exp(jvt)  dv/2*  ] 

where  RE(x)  is  the  real  part  of  x. 

THEOREM. 

Let  f(t)  be  a  real  time  signal  with  Fourier  transform  F(v)  and  constant-Q 
transform  F[g,t].  Further,  assume  F(v)=0,  v=0.  Then  for  Bp  as  in  temma  2, 
f(t)  is  recoverable  from  F[R^(w) ,t|  by  the  transformation 

f(t)  =  (2/Bp)  RE{  / F[Rp(w),t|  exp|jRp(w)tl  dw  \  (7) 

The  synthesis  expression  of  equation  7  shows  that  all  channel  signals 

are  first  modulated  back  to  their  original  position  frequencies  after  which 

all  channel  signals  are  integrated  or  summed.  The  real  part  of  this  sun  is 

then  scaled  by  the  constant  2/Bp  to  recover  f(t). 

Critical  Band  Transform 

In  this  section  the  constant-Q  transform  is  first  sampled  in  frequency 
and  then  modified  to  a  form  which  approximates  the  critical  band  filterbank 
properties  of  the  auditory  periphery. 

From  lenina  1,  property  3),  the  bandpass  function  Hp  must  be  homogeneous 
of  degree  p.  A  function  Hp  which  satisfies  auditory  filter  characteristics 
given  by  Evans  and  Wilson  [5]  and  also  conforms  to  the  above  conditions  for 
homogeneity  and  has  a  Q  of  6  has  been  implemented  in  this  work  as  a  modified 
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form  of  the  beta  density  function.  The  parameters  a,b,  in  the  following 
expression  for  Hp,  were  fixed  experimentally  to  set  both  the  Q  and  the  skirt 
slopes  of  the  filter.  For  Fourier  frequency  variable  v,  position  frequency 
Rp(w) ,  and  parameters  a,b, 

/a{[(b+a)/a!  Rp(w)-v}b 
l(b/a)b  Rp(w)a+b+P 

Hp[Rp(w) ,v(  =/for  0<v<(b+a)/a  R_(w) 

'  H  (8) 

0,  otherwise. 


In  a  discrete  implementation,  a  finite  set  of  position  frequencies  may 
be  determined  by  evaluating  R^(w)  for  discrete  values  of  w.  Based  on  the  data 
of  Wever  [6],  Zwislocki  (7|  derived  a  relationship  between  critical  bandwidth 
and  the  density  of  neurons  which  connect  with  sensory  cells  of  the  inner  ear, 
located  along  the  basilar  membrane.  This  relationship  suggests  that  1300 
neurons  approximately  correspond  to  an  interval  of  one  critical  band,  and  that 
critical  bands  represent  uniform  distance  increments  along  the  basilar 
membrane . 


Uniform  spacing  on  the  basilar  membrane  corresponds  to  an  exponential 
spacing  of  frequency  measured  in  Hertz  [8).  Thus,  the  position  frequency 
function 

Rp(w)  =  exp(w) 

is  chosen  which,  from  lemma  1,  property  4),  gives  p=0.  Discrete  position 

\  frequencies  of  filters  in  the  constant-Q  filterbank  are  then  given  by  the  set 

Rq(w1)  =  exp(w- ) ,  i=l, N. 

wi"wi-i=(wN-Wi)/(N-1) . 


where 
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Substituting  these  discrete  values  of  Rq(w)  into  equation  5  gives 
F[exp(wi),t]  =  exp[-j  exp(wi)t] 

X  f F(v)  Hq [exp(w^)  ,v)  exp(jvt)  dv/ZTT 

i=l,N,  (9) 

which  specifies  the  constant-Q  transform  at  the  N  analysis  frequencies 
exp(wi) ,  i=l,N. 

For  this  implementation,  total  signal  bandwidth  was  limited  to  4  kHz. 
Position  frequencies  were  initially  selected  over  50  positions  from  exp(w2)=40 
Hz  to  exp(w5Q) =3900  Hz, 

Because  the  Q  of  critical  bandwidth  drops  off  toward  lower  frequencies, 
the  wider  bandwidths  in  this  frequency  region  may  be  achieved  by  summing  small 
groups  of  filters  from  the  constant-Q  bank.  By  interactively  summing  groups 
of  low  frequency  filters  and  measuring  the  resulting  bandwidth,  the  50  filters 
of  the  constant-Q  bank  described  above  were  reduced  to  only  23  filters  which 
closely  conform  to  critical  bandwidths.  The  resulting  critical  band  filterbank 
is  plotted  in  figure  1,  where  filters  1  through  6  have  been  normalized  to  1. 
It  can  be  shown  [4]  that  summing  these  filters  results  in  an  overall  frequency 
response  which  has  a  passband  ripple  of  0.2dB. 

SUMMARY 

Through  the  design  of  transformations  which  relate  acoustic  signals  to 
their  critical  band  representations,  we  create  a  means  for  relating  signal 
modifications  to  perceptual  criteria.  Thus  signal  processing  in  the  critical 
band  domain  may  be  evaluated  in  the  stimulus  domain  through  the  combined 
process  of  reconstruction  and  listening  to  the  processed  signal.  Additional 
work  in  the  processing  of  critical  band  signals  has  been  conducted  by  the 
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authors [9]  where  time-varying  modifications  to  critical  band  intensities  are 
performed  to  improve  perceived  signal-to-noise  ratios. 
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ACOUSTIC  NDISE  SUPPRESSION  IN  THE 
CONTEXT  OF  A  PERCEPTUAL  MODEL 


Tracy  L.  Petersen 
Steven  F.  Boll 

ABSTRACT 

An  acoustic  noise  suppression  algorithm  has  been  developed  which 
suppresses  noise  from  speech  by  first  filtering  it  into  a  set  of  signals  which 
approximate  the  loudness  components  perceived  by  the  auditory  system.  These 
signals  are  generated  by  passing  the  input  stimulus  waveform  through  a  filter 
bank  with  frequency  bandwidths  which  approximate  the  ear's  critical 
bandwidths.  The  noise  on  each  signal  is  then  suppressed  using  spectral 
subtraction  techniques  in  a  domain  of  simulated  perception.  This  approach  to 
noise  suppression  retains  the  intelligibility  produced  by  spectral  subtraction 
methods  while  eliminating  the  accompanying  musical  quality. 

INTRODUCTION 

The  work  to  be  described  explores  acoustic  signal  processing  within  the 
domain  of  perception.  Such  an  approach  requires  both  a  knowledge  of  auditory 
system  signal  processing  transformations,  and  adequate  techniques  for  their 
simulation.  Given  a  capability  to  maD  acoustic  signals  into  the  domain  of 
perception  and  process  this  transformed  information  to  suppress  perceived 
ievels  of  background  noise,  the  processing  must  be  followed  by  inverse 
transformations  which  return  perceptually  processed  signals  to  an  acoustic 
signal  representation.  This  approach  is  initiated  from  a  signal  processing 
framework  which  is  based  on  a  mathematical  model  of  peripheral  auditory 
frequency  analysis.  Mathematical  formulations  for  loudness  perception  and  the 
selective  identification  of  a  tone  in  noise  are  implemented  to  suppress  noise 
loudness  as  the  simulated  function  of  auditory  brain  activity.  Ttie  brain's 
ability  to  concentrate  upon  signal  components  while  ignoring  the  loudness  of 
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background  noise  is  described  as  an  operation  of  selective  listening.  Each 
stage  of  the  mathematical  modeling  is  invertable.  Thus  it  is  possible  to 
estimate  processed  signal  intensities  which  in  theory  simulate  the  perception 
of  signal  loudness  without  imposing  a  need  upon  the  brain  to  invoke  the 
operation  of  selective  listening  in  order  to  suppress  the  loudness  of  a 

masking  background  noise. 

PERCEPTUAL  SUBTRACTION  OF  NOISE 

Critical  Band  Filtering 

Peripheral  auditory  analysis  of  the  ear  may  be  likened  to  a  bank  of 

bandpass  filters.  The  filters  which  form  this  auditory  filter  bank  are  called 

critical  bands  [1].  In  this  work  we  use  the  critical  band  analysis-synthesis 

method  as  given  in  (2,  3|  This  method  simulates  the  critical  band  frequency 

analysis  of  the  auditory  periphery,  while  an  inversion  formula  allows  this 

signal  to  be  reconstructed  from  its  critical  band  filter  bank  analysis 

representation.  Analysis  over  a  4kHz  bandwidth  was  performed  with  a  bank  of 

21  critical  band  filters. 

Auditory  Threshold  and  Masking 

In  audition  the  term  "masking”  is  used  to  describe  the  situation  where 
the  loudness  of  a  particular  sound  partially  or  completely  obscures  from 
perception  a  second  sound.  The  masking  sound  is  said  to  induce  a  threshold 
shift  in  signal  detectability. 

It  is  known  that  the  threshold  intensity  of  a  pure  unmasked  tone  varies 
as  a  function  of  tone  frequency.  Some  workers  have  suggested  [4]  that  the 
frequency  dependent  threshold  shifts  outside  the  minimum  threshold  region  may 
be  modeled  as  t  e  result  of  internal  masking  which  is  inherent  in  the 
mechanisms  of  the  auditory  system  itself.  This  approach  proves  to  be  useful 


in  modeling  loudness  perception  as  discussed  in  the  remaining  sections  of  this 


paper . 

Loudness  Perception 

It  is  known  that  strong  comprcssional  mechanisms  within  the  auditory 
system  transform  a  stimulus  intensity  range  of  roughly  twelve  orders  of 
magnitude  down  to  a  subjective  range  of  aoproximately  three  or  four  orders. 
Stevens  has  shown  [5]  that  loudness  perception  tends  to  be  a  specific 
mathematical  function  of  stimulus  intensity.  If  loudness  is  designated  L,  and 
stimulus  intensity  I,  then 

L  =  b  1°  (1) 

Equation  1  gives  the  relationship  which  Stevens  called  the  psychophysical 

power  law.  it  shows  loudness  to  be  a  simple  power  function  of  intensity. 

Heilman  and  Zwislocki  [6]  determined  a  value  of  the  exponent  to  be  0.27. 

A  Model  for  Selective  Listening 

It  is  important  to  note  that  the  critical  band  is  an  interval  over 
which  the  ear  integrates  energy.  Threshold  elevation  induced  by  an  external 
masking  noise  is  proportional  to  the  noise  energy  within  the  critical  band 
associated  with  the  masking  [ 7 J .  Zwislocki  [4]  has  formulated  an  expression 
for  loudness  perception  o/er  critical  band  intervals  which  puts  an  additional 
interpretation  upon  the  power  law  described  in  the  previous  section. 
Zwislocki  reasoned  that  loudness  perception  could  be  represented 
mathematically  in  terms  of  the  phenomenon  of  selective  listening  which  is 
implicit  in  psychophysical  masking  experiments.  Selective  listening  refers  to 
the  ability  of  a  listener  to  selectively  observe  either  the  loudness  of  signal 
and  noise,  the  loudness  of  signal,  or  the  loudness  of  noise  when  signal  and 
noise  are  presented  simultaneously.  It  is  the  ability  of  the  ear  to  perform 
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selective  listening  tasks  that  makes  possible  the  measurement  of  loudness 
functions  under  masking[8].  Zwislocki  theorized  that  selectively  listening  to 
a  tone  in  noise  required  a  subtraction  of  noise  loudness  from  total  loudness 
within  the  domain  of  perception. 

In  a  masking  situation  the  critical  band  contains  the  intensity  1  of 
the  signal,  and  the  intensity  E  of  an  externally  presented  masking  noise. 
Here,  as  discussed  earlier,  absolute  threshold  is  modeled  as  a  masked 
threshold  shift  due  to  an  internal  masking  intensity.  M.  Scharf  [9]  shows  that 
the  intensity  M  is  4  dB  above  the  absolute  threshold  for  a  tone  at  critical 
band  center  frequency.  According  to  the  power  law  the  summed  intensities 
produce  a  total  critical  band  loudness 

Lt  =  b( I+E+M)8  (2) 

where  b  is  a  constant  which  depends  on  choice  of  units.  To  obtain  an 
expression  for  the  loudness  of  the  siqnal  in  noise  within  the  critical  band 

the  selective  listening  hypothesis  is  invoked  to  subtract  off  loudness  due  to 
the  masking  intensities.  This  qives  the  loudness  of  the  siqnal  L$  to  be 
L$  =  b[(l+E+M)8  -  (E+M)8]  (3) 

At  this  point  it  is  assumed  the  brain  has  performed  its  selective  listeninq 
operation,  and  in  concentrating  on  the  signal,  perceives  the  critical  band 

loudness  L$. 

Input/Qutput  Transformation 

What  is  desired  now  as  a  processing  goal  is  a  stimulus  domain 

representation  of  signal  intensity  which  would  induce  the  perception  of 
loudness  ls  while  suppressing  the  perception  of  loudness  due  to  the  external 
masking  noise.  The  following  is  a  derivation  of  an  input/output 
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characteristic  which  yields  the  desired  intensity.  Lg  is  first  equated  with 
the  loudness  Ls  that  would  be  produced  by  some  unmasked  stimulus  of  intensity 
J.  An  input/output  characteristic  is  then  derived  which  gives  J  in  terms  of 
signal  intensity  I,  external  noise  intensity  E,  internal  masking  intensity  M, 
and  psychophysical  power  exponent  6.  Because  Lg  is  unmasked,  the  expression 
for  Ls  in  terms  of  J  nas  zero  external  noise  intensity,  and  by  definition 

Ls  =  b[ (J+M)°  -  M°)  .  (4) 

ihe  equality  of  equations  i,  and  4  then  gives 
b[(J+M)°  -  M°)  =  b|(I+E+M)°  -  (E+M) ®] 

(J+M}°  =  [(I+E+M)3  -  (E+M)  ®  j  +  M° 

J  =  ;[(I+E+M)°  -  (E+M) -  M.  (5) 


This  new  signal  intensity  J  is  one  which  in  theory  stimulates  the  perception 
of  signal  loudness  Lg  without  imposing  a  need  upon  the  brain  to  invoke  the 
operation  of  selective  listening  in  order  to  suppress  the  loudness  of  the 


external  masking  noise. 

SIGNAL  PROCESSING  IMPLEMENTATION 

Critical  Band  Signal  Generation 

The  processing  of  loudness  information  requires  the  computation  of 
intensity  for  each  [3,  2|  critical  band  in  the  analysis  transform  filterbank. 
For  tnis  implementation  each  critical  band  filter  is  real,  zero  over  negative 
frequencies,  and  therefore  has  a  complex  time  response. 


Given  a  critical  band  filterbank  composed  of  N  filters,  the  kth 
critical  band  filter  operates  on  a  real  input  signal  f(t)  to  produce  a  complex 
bandpass  time  signal.  The  time  varying  intensity,  Zk(t)  within  the  ith 
critical  band  is  taken  us  the  square  of  the  instantaneous  amplitude  of  the 
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complex  signal. 

The  perceptual  subtraction  of  noise  as  represented  by  equation  5 
assumes  that  the  noise  is  stationary  and  that  the  expected  value  of  noise 
intensity  within  each  critical  band  is  known.  Critical  band  noise  intensity 
estimates  were  obtained  by  performing  critical  band  analysis  over  noise  only 
time  intervals.  For  critical  band  k  the  expected  noise  intensity  Ek  was 

determined  as  a  long-time  average  of  the  squared  instantaneous  envelope. 

Noise  Suppression 

The  critical  band  intensity  Zk(t)  is  duo  to  both  signal  and  noise. 
Given  that  Jk(t)  is  the  processed  intensity  at  the  kth  critical  band,  equation 
5  then  takes  the  form 

Jk(t)  =  t(Zk(t)+i\)0-(Ek+Mk)°+M^1/O-Mk  (6) 

Equation  6  defines  the  process  of  spectral  subtraction  in  the  perceptual 
domain  as  motivated  by  the  simulation  of  selective  listening.  Critical 
f i 1 terbunk  analysis  is  applied  to  f(t),  producing  N  complex  time  signals. 
Instantaneous  intensities  Zk(t)  are  computed  and  each  are  processed  according 
to  equation  6  to  create  a  new  critical  band  intensity  Jk(t)  .  The  appropriate 
inverse  operations  are  then  performed  and  the  N  channels  are  summed  to  form 
the  output  speech. 

SUMMARY  AND  CONCLUSIONS 

The  success  of  this  work  both  follows  from  and  contrasts  work  by  others 
using  spectral  subtraction  I  10).  The  parallel  between  the  method  of 
perceptual  subtraction  and  the  method  of  spectral  subtraction  is  that  in  both 
cases  noise  estimates  are  locally  subtracted  out  in  a  transformed  signal 
space.  In  the  case  of  spectral  subtraction  this  transformed  signal  space  is 
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the  short-time  Fourier  spectrum.  In  the  work  presented  here,  this  transformed 
signal  space  is  the  perceptual  space  of  critical  band  loudness,  where 
estimated  noise  loudnesses  are  subtracted  from  input  signal  loudness. 

Typically,  when  noise  is  suppressed  by  a  time-varying  attenuation  of 
signal  frequencies,  successful  processing  requires  reasonable  signal-to-noise 
ratios.  In  the  case  of  perceptual  subtraction,  work  by  Heilman  and  Zwislocki 
[8)  formally  suggests  why  this  should  be  so.  They  observed  that  their  results 
parallel  results  obtained  by  Miskolczy-Fodor  [11]  in  the  measurement  of 
loudness  perception  by  listeners  with  a  particular  hearing  loss.  Hiey  found 
that  loudness  curves  with  noise  induced  threshold  shifts  have  essentially  the 
same  form  as  loudness  curves  obtained  from  listeners  who  suffer  sensorineural 
hearing  loss  (recruitment)  resulting  in  higher  than  normal  perception 
thresholds.  Perceptual  subtraction  is  formulated  to  pass  perceptible  critical 
band  signal  loudnes  as  and  suppress  critical  band  components  which  contribute 
only  noise.  Based  on  the  observation  of  Heilman  and  Zwislocki  it  is  possible 
to  interpret  the  method  of  perceptual  subtraction  processing  as  a  method  for 
simulating  perception  deafness  to  noise.  Clearly,  the  opportunity  to  improve 
time-varying  signal-to-noise  ratios  through  dynamic  attenuation  of  signal 
frequencies  becomes  limited  as  the  noise  begins  to  totally  overtake  the  signal 
because  perceptual  subtraction  can  only  preserve  signal  components  which  have 
perceptible  intensities.  In  simulating  perception  deafness  to  noise  we 
inevitably  simulate  deafness  to  sianal  as  well  when  noise  completely  dominates 
the  signal. 

For  testing  purposes,  speech  signals  were  additively  combined  with 
broadband  white  gaussian  noise  at  signal-to-noise  ratios  from  40  to  0  dB  in  10 
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dB  increments.  Each  sample,  so  constructed,  was  then  processed  for  noise 
suppression  by  the  method  of  perceptual  subtraction.  In  evaluating  both  the 
perceptual  subtraction  and  spectral  subtraction  algorithms,  listening  tests 
revealed  that  when  the  signal-to-noise  ratio  of  the  input  speech  becomes  less 
that  10  dB,  the  quality  of  processed  speech  decreases  sharply.  In  all  cases, 
however,  a  dramatic  reduction  of  background  noise  was  observed.  A  prominent 
difference  in  processing  results  of  this  method  with  spectral  subtraction  was 
fotnd  to  be  in  the  overall  perceived  "smoothness"  with  which  noise  is 
suppressed.  Spectral  subtraction  processing  produces  speech  with  a  somewhat 
harsher  quality  than  that  produced  by  perceptual  subtraction.  Also,  spectral 
subtraction  method  tends  to  admit  small,  but  nevertheless  sharply  perceived, 
occurrences  of  noise  residual  artifacts.  In  the  case  of  perceptual 
subtraction,  any  remaining  noise  artifacts  were  near  audible  threshold  and 
judged  generally  less  offensive. 
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ABSTRACT 

Vocoders  based  on  linear  models  of  the  vocal  tract  such  as  LPC  result  in  an 
inverse  polynomial  fit  of  the  spectrum,  and  along  with  homomorphic  vocoders 
require  pitch  estimation.  Analysis/ synthesis  using  B-spline  basis  functions 
is  proposed  in  this  preliminary  paper.  This  is  a  local  approximation  scheme 
which  permits  a  concentration  of  parameters  in  regions  of  greater  importance 
and  employs  easily  computed  least  square  coefficients.  It  can  be  used  with 
pitch  based  vocoders  or  as  a  standalone  nonpitch  based  vocoder.  An 
experimental  system,  not  yet  optimized  using  special  properties  of  human 
speech,  has  been  applied  to  samples  of  male  speech,  female  speech, 
simultaneous  speech  with  two  speakers,  and  noisy  speech.  Empirical  results  on 
tape  will  be  presented. 


INTRODUCTION 

Two  frequently  used  signal  analysis/ synthesis  techniques  are  linear  predictive 
codirg(5)  and  homomorphic  filtering(7) .  Usually  both  of  these  methods  result 
in  an  approximation  to  the  spectrum  that  has  uniform  characteristics  over  the 
whole  spectrun.  Using  the  method  proposed  below  in  conjuncvion  with  pitch 
extraction  can  lead  to  a  pitch  based  vocoder  with  a  tightness  of  fit  that 
varies  with  the  frequency.  Such  a  fit  can  be  done  to  some  extent  using 
"selective"  LPC(6>,  but  it  results 'in  piecewise  inverse  polynomials  over  the 
different  frequency  spans  will  probably  not  match  at  the  cut  frequency  to 
provide  a  cortinuous  function,  nucn  less  have  any  derivative  continuity.  In 
fact,  such  a  fit  would  be  unusual.  In  the  proposed  method  continuity  and  a 
somewhat  variable  degree  of  derivative  continuity  car  be  insured,  but  it 
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affects  the  nimber  of  parameters  required. 


Horacmorphio  filtering  was  developed  as  •  «•»  t0 

can  b.  used  to  determine  pitch  a.  -oil  a.  vocal  tract  impulse  raaponae.  Th. 
.othob  in  it.  full  generality,  never,  involve,  the  eo.put.tlo.  of  the 
oo.pl..  cepstrun,  a  difficult  problem  aino.  the  ph.ee  information  i.  not  in 
convenient  for..  Tnc  .plin.  vocoder  used  on  the  power  .pectru.  yield.  .  »od.l 
in  -hicn  the  variation,  of  the  1-th  parameter  i.  baaed  on  change,  in  energy  in 
tr-  signal  over  the  band  of  frequencies  fro.  -t  to  wuk.  If  the  change,  ere 
cot  .brupt.  then  th.  parameter,  should  Cheng,  in  a  nonabrupt  manner  mlso. 

models  are  ba.ed  on  modelling  th.  signal  a.  the  output  of  a  single  *~.l 
tract.  .0  the,  are  1...  robust  in  situation,  -her.  th.  hypothe.«.  are  not 
applicable,  a.  in  the  presence  of  multiple  .pe.her.  or  .  nol.y  environment. 
Th.  spline  method  may  be  used  differently  in  this  situation.  Hatching  can  be 
don.  on  the  real  and  imaginary  parts  of  th.  Fourier  transform  of  the  windowed 
apeech.  and  a  synthetic  speech  waveform  generated.  An  initial  ette.pt  using 
this  method  with  many  parameters  has  Indicated  that  it  merit.  ur  er 
attention.  Since  th.  phase  information  is  implicitly  calculated,  the  problem 
of  unwrapping  th.  r.w  date  is  not  present,  end  since  pitch  .street  on  s 

unnecessery.  th.  quality  of  th.  speech  1.  impervious  t0  *“  *tCh 
speaker  (equally  well  for  female  es  «U  •»  •  tb*  "*bW  °f 

speakers,  or  the  presence  of  noise. 

B-SPLINES 

The  computation,  involved  become  feasible  because  of  th.  cher.ct.rl.tlc.  of 

splines  in  particular  and  B  splines  in  particularC 1 ,2) . 

Definition  1:  w.  say  Jik<I>  il  i  iHliSS  f£  SSSL  S  <»" 

sequence  Ut> .  «"d  *  cardlXj  :xj  =  V . 

1.  it  is  a  polynomial  of  degree  (  k-1)  of 

2.  Sk(*1)€Ck_1“"j. 

The  Xu's  are  called  the  knots. 

g  polynomial  is  simply  •  SPH"«  “"«>  °f  •“U1|,UClty  l"l> 

Since  spline,  ere  ■pl.c.-le.  polynomlels- .  the,  can  preserve  the  desirable 

characteristic,  of  polynomial  approaim.tion  -nil.  allowing  more  fusibility. 
Tt  has  been  proposed  to  use  Integrals  of  Walsh  function,  as  basis  functions 
for  decomposing  signalsla).  but  the,  are  instances  of  spline  function,  with 

knots  at  appropriate  powers  of  (1/2). 
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Definition  2:  The  i-th  B-spline  of  order  k^  B1>k(x),  on  knot  set  Uj} 
is  a  spline  with  the  additional  properties  that— 

4 

1.  Bi  k(x)  =  0  for 

*  € 

2.  Bi  k(x)  >0.  for 

3.  one  of  several  normalizations,  the  two  most  common  being!  Bi>k 

(x)  =  1.  o^Bi  >k  =1.  J 

The  collection  of  B-splines  form  a  basis  for  the  vector  space  of  all  splines 
of  order  k  on  that  knot  sequence.  Further,  the  above  requirements  mean  that 
there  are  at  most  k  nonzero  functions  for  any  value  of  x.  The  matrices  that 
result  in  applying  linear  combinations  of  B-splines  to  solve  least  square  or 
interpolation  problems  are  then  banded  of  width  (2k-1),  and  hence  are  more 
easily  solvable  computationally.  If  the  knots  are  evenly  spaced,  all  the 
Bi  k(x)  are  just  translations  of  one  fixed  B-spline,  and  they  have  the 

interesting  convolutional  property  that  Bi  k*Bp  r  =  Bq,ic«.r’  and  hence  *  spllne 
filter  acting  on  a  spline  signal  yields  a  higher  order  spline  with  known 
parameters.  The  ideal  low  pass  filter,  the  Fourier  window,  the  triangular 
window,  and  the  Parzens  window  are  all  examples  of  B-splines  occuring  in 
signal  processing,  as  is  any  other  window  or  filter  that  can  be  represented  as 
a  piecewise  polynomial.  Indeed,  their  versatility  has  caused  them  to  be  used 
in  some  Computer  Aided  Geometric  Design  systems  instead  of  Rational 

polynomials. 

SPLINE  METHOD 

While  the  general  ideas  described  here  can  be  applied  to  a  variety  of 
situations,  we  shall  develop  an  application  here  that  fits  the  signal  by 
fitting  real  and  imaginary  parts  of  the  Fourier  transform  with  linear 
combinations  of  B-splines  and  then  resyntheses  the  the  signal  using  the 
inverse  transforms  of  the  B-splines.  Frequently  these  inverse  transforms  can 
be  tabled  so  that  the  inverse  transform  need  net  be  computed. 

Let  s(t)  be  a  low  pass  filtered  version  of  a  signal,  and  let  »(tp,t)  « 
sCt-t  ) wCt)  be  the  windowed  signal  .it  is  desired  to  aproximata.  For  ease  of 
presentation  we  call  S(w)«  F[s<tp.t>],  where  F  designates  the  Fourier 

transform. 

H-i'c* forth  the  k  denoting  the  order  of  the  spline  will  be  omitted  since  it  is 
it a-t  the  sane  within  any  application.  We  wish  to  fit  S(w)  in  some  optimal 
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raanre'  by  a  B-spline  function  given  by 
rn 


$(w)  = 


<L  c,  B.  (w)  ,  w  >  0 
1  1 

^Ci#Si(-w),  otherwise, 


(1) 


where  the  +  j  are  complex  valued  parameters.  We  desire  to  minimize 

the  e_ror  £  wrier  defined  in  the  standard  least  squares  sense  as  follows. 

E  =  JlS(w)-S(w)t2 


(2) 


The  parameters  are  determined  by  minimizing  E  in  (2)  with  respect  to  each  of 
the  parameters,  which  is  done  by  setting 
Qe-^  i  Oj^E-Qy^  0;  i  = 

Tne  resulting  linear  equations  are  banded  since  B-splines'  have  supports  which 
overlap  only  partially  ,i .e . , 

\  3  (w)B  (w)  =  0. 

J  when  p^i-k+2, ... ,i+k-1. 

Inversion  and  solution  for  the  parameters  is  computationally  easier  than  the 
polynomial  case  using  a  power  basis. 


We  next  determine  the  equations  of  tne  time  wave  form  corresponding  to  this 
method  of  fitting  the  spectrin.  While  the  inverse  transform  of  the  general 
basis  function  is  rather  complicated,  we  can  develop  a  formulation  for 
spacific  instances. 


Let  the  knots  wi , . 

A.(w)  = 


w^+k  be  evenly  spaced  with  spacing 
c^Cw) ,  w  >y  0, 

ci#Bi(-w) ,  otherwise. 


and 


Tnen,  using  the  convolutional  propertry  of  uniformly  spaced  B-splines  and  the 
convolutional  property  of  Fourier  transforms  yields 

(3) 


F*''[Ai(w)]=(Di/2)K^sin(Diw/2^Diw/2)k 

*(ajccs  wi+2t  +  b^in  wi  +  ?t) 


T.nus,  if  all  the  knots  have  spacing  D  the  estimate  of  s(tp,t)  is 


K  ;:.'2,4*n(Dw/2y^w/2)k 

X  ,  :cs(i+2)Dt  «■  b^  sin(  i+2  )Dt . 
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A  decaying  high  order  trignometric  polynomial  with  fundamental  frequency  D  is 
the  resulting  signal.  More  interesting  cases  occur  when  the  spacing  is 
nonuniform.  This  feature  allows  a  much  closer  fit  in  frequency  ranges 
selectively  determined  to  be  more  important  to  the  intelligibility  of  the 
signal,  and  has  some  resemblances  to  selective  LPC  while  insuring  the  degree 
of  continuity  desired.  The  simplest  ease  uses  sections  of  uniformly  spaced 
knots  with  different  spacing  in  each  section.  The  inverse  transforms  of  the 
transition  A(w)'s  will  not  have  the  simple  form  derived  above,  but  the 
contributions  of  the  others  to  the  synthetic  waveform  are  suns  of  decaying 
trignometric  polynomials  with  different  fundamental  frequencies  and  different 
rates  of  decay.  It  is  postulated  that  these  parametrized  waveforms  carry 
signal  information  in  a  form  faithful  to  the  original. 

APPLICATIONS 


Tnis  general  class  of  methods  has  not  yet  been  widely  tested  or  developed. 
However,  it  has  been  applied  to  a  variety  of  selective  speech  signals  to  test 
for  intelligibility  and  faithfulness  in  the  presence  of  multiple  speakers, 
female  speakers,  and  noise  at  various  levels,  as  well  as  on  clear  speech. 
Figures  1-4  illustrate  a  sequence  from  a  voiced  signal  sampled  at  10  kh.  In 
each  of  the  figures  the  one  on  the  left  is  from  the  original  signal,  while 
that  on  the  right  is  from  the  synthetic  signal.  Further  testing  to  determine 
good  knot  locations  and  nunber  of  parameters  desirable  for  various 
applications  seem  worthwhile,  including  gaining  further  information  about  the 
phase  of  the  signal. 
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